If you do not know what tSNE is, how it works and did not read the original revolutionary van der Maaten & Hinton paper from 2008, you probably do not need to know because tSNE is basically dead by now. Despite tSNE made a drammatic impact for single cell genomics and Data Science in general, it is widely recognized to have a few disadvantages which have to be fixed sooner or later.
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/HowUMAPWorks/tSNE_is_Dead.png', width=2000)
What is it exactly that makes us uncomfortable using tSNE for single cell genomics? Here I summarize a few points with short comments:
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/HowUMAPWorks/tSNE_MNIST_Clusters.png', width=2000)
tSNE is a relatively simple Machine Learning algorithm which can be summarized by the following four equations:
$$ p_{j|i} = \frac{\exp \left ( - || x_i - x_j || ^2 \big / 2 \sigma_i^2 \right ) }{\sum_{k \neq i} \exp \left ( - || x_i - x_k || ^2 \big / 2 \sigma_i^2 \right )}, \hspace{2em} p_{ij} = \frac{p_{i|j} + p_{j|i}}{2N} \hspace{2em} (1) $$$$ \rm{Perplexity} = \displaystyle 2^{\displaystyle -\sum_j p_{j|i} \log_2 p_{j|i}} \hspace{2em} (2) $$$$ q_{ij} = \frac{ \left ( 1 + || y_i - y_j || ^2 \right ) ^ {-1} }{\sum_{k \neq l} \left ( 1 + || y_k - y_l || ^2 \right ) ^ {-1} } \hspace{2em} (3) $$$$ KL(P_i || Q_i) = \sum_i \sum_j p_{j|i} \log \frac {p_{j|i}} {q_{j|i}}, \hspace{2em} \frac{\partial KL}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij}) (y_i - y_j) \left ( 1 + || y_i - y_j || ^2 \right ) ^ {-1} \hspace{2em} (4) $$Eq. (1) defines the Gaussian probability of observing distances between any two points in the high-dimensional space, which satisfy the symmetry condition. Eq.(2) introduces the concept of Perplexity as a constraint that determines optimal $\sigma_i$ for each sample. Eq.(3) declares a Student t-distribution for the distances between the pairs of points in the low-dimensional embedding. The heavy tails of the Student t-distribution are to overcome the Crowding Problem when embedding into low dimensions. Eq. (4) provides the Kullback-Leibler divergence loss function for projection of high-dimensional probability onto the low-dimensional probability, and the analytical form of the gradient to be used in the Gradient Descent optimization.
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/HowUMAPWorks/Gauss_vs_Student.png', width=2000)
Since tSNE is a really simple algorithm, let us try to program tSNE in Python from scratch.. As usually, we need to import necessary Python libraries, here we are going to use mostly numpy, so not so much to import:
As a test data set, we will be using the Cancer Associated Fibroblasts (CAFs) scRNAseq data. We start with importing Python libraries (mainly numpy and scikit-learn will be used), having a look at the data matrix and checking the dimensions of the data set. Please note, that the rows are cells and columns are genes here, the last column contains the coding of the clustering results, i.e. all cells belong to cluster # 1, 2, 3 or 4:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
path = '/home/nikolay/WABI/K_Pietras/Manifold_Learning/'
expr = pd.read_csv(path + 'bartoschek_filtered_expr_rpkm.txt', sep='\t')
print(expr.iloc[0:4,0:4])
X_train = expr.values[:,0:(expr.shape[1]-1)]
X_train = np.log(X_train + 1)
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = expr.values[:,expr.shape[1]-1]
print("\nDimensions of the data set: ")
print(X_train.shape, y_train.shape)
One important global variable to define from the very beginning is the martix of squared pairwise Euclidean distances for the initial high-dimensional scRNAseq data set, this matrix will be used a lot in the future code:
dist = np.square(euclidean_distances(X_train, X_train))
dist[0:4, 0:4]
Next, using the matrix of squared pairwise Euclidean distances we will compute the matrix of probabilities in the high-dimensional space from Eq. (1). Knowing the matrix we can easily calculate the entory and perplexity as $2^{\rm{Entropy}}$. Please not the argument sigma of the function. This implies, the matrix of probabilities as a function of $\sigma$ will be used later for the binary search procedure that computes the optimal sigmas for the fixed perplexity.
def prob_high_dim(sigma, dist_row):
"""
For each row of Euclidean distance matrix (dist_row) compute
probability in high dimensions (1D array)
"""
exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
exp_distance[dist_row] = 0
prob_not_symmetr = exp_distance / np.sum(exp_distance)
#prob_symmetr = (prob_not_symmetr + prob_not_symmetr.T) / (2*n_samples)
return prob_not_symmetr
def perplexity(prob):
"""
Compute perplexity (scalar) for each 1D array of high-dimensional probability
"""
return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))
A trick here is that for convenience in the future we compute a 1D array of probabilities for each i-th cell, i.e. for each i-th row (or column) of the squared pairwise Euclidean distance matrix (distance from i-th cell to all other cells in the data set). This is done because of $\sigma_i$ in the exponent in Eq. (1), i.e. we have to use the binary search in order to find $\sigma_i$ for each i-th cell in the data set. Next, for each i-th cell, given the 1D array of the high-dimenional probabilities, we can sum up the elements of the array and compute the Perplexity according to the definition in Eq. (2). So we have a function which produces the perplexity (scalar) value for each $\sigma_i$ for each i-th cell. Now we can fix the perplexity value (desired perplexity aka hyperparameter) and throw this function into the binary search procedure below in order to compute $\sigma_i$ for each i-th cell:
def sigma_binary_search(perp_of_sigma, fixed_perplexity):
"""
Solve equation perp_of_sigma(sigma) = fixed_perplexity
with respect to sigma by the binary search algorithm
"""
sigma_lower_limit = 0
sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if perp_of_sigma(approx_sigma) < fixed_perplexity:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
Finally, we can define a lambda perplexity function of $\sigma_i$, fix the desired perplexity to e.g. 30 and run the binary search for each i-th cell. This will give us $\sigma_i$, substituting it back to Eq. (1) allows calculating the matrix of probabilities of pairwise Euclidean distances in the high-dimensional space. Technically, for each $\sigma_i$ we get a 1D array of probabilities; computing $\sigma_i$ for each cell we get a n-cells 1D arrays of proabilities which all together build a matrix of high-dimensional probabilities:
PERPLEXITY = 30
prob = np.zeros((n,n))
sigma_array = []
for dist_row in range(n):
func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, PERPLEXITY)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
sigma_array.append(binary_search_result)
if (dist_row + 1) % 100 == 0:
print("Sigma binary search finished {0} of {1} cells".format(dist_row + 1, n))
print("\nMean sigma = " + str(np.mean(sigma_array)))
np.array(sigma_array)[0:20]
prob[0:4, 0:4]
Last thing we do, we apply the symmetry condition from Eq. (1). Now we are all done with Eqs. (1, 2).
P = prob + np.transpose(prob)
#P = (prob + np.transpose(prob)) / (2*n)
#P = P / np.sum(P)
The major part of coding is over, to program the low-dimensional layout / embedding from Eq. (3) is really simple. Note that the matrix of low-dimensional probabilities, Eq. (3), accepts coordinates Y of the low-dimenional embeddings as a parameter. This implies, that the Y coordinates will be optimized later via the KL-divergence.
def prob_low_dim(Y):
"""
Compute matrix of probabilities q_ij in low-dimensional space
"""
inv_distances = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
np.fill_diagonal(inv_distances, 0.)
return inv_distances / np.sum(inv_distances, axis = 1, keepdims = True)
Actually, we are not going to use the definition of KL-divergence from Eq. (4) in the code, we only need the gradient of KL-divergence with respect to the coordinates Y of the low-dimensional embeddings. However, it is still good to calculate KL-divergence at each iteration of the Gradient Descent in order to see that it really decreases during optimization.
def KL(P, Y):
"""
Compute KL-divergence from matrix of high-dimensional probabilities
and coordinates of low-dimensional embeddings
"""
Q = prob_low_dim(Y)
return P * np.log(P + 0.01) - P * np.log(Q + 0.01)
Computing the gradient of KL-divergence from Eq. (4) is a bit tricky and I am using the method from the blog of Liam Schoneveld:
def KL_gradient(P, Y):
"""
Compute gradient of KL-divergence
"""
Q = prob_low_dim(Y)
y_diff = np.expand_dims(Y, 1) - np.expand_dims(Y, 0)
inv_dist = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
return 4*np.sum(np.expand_dims(P - Q, 2) * y_diff * np.expand_dims(inv_dist, 2), axis = 1)
Now everything is ready for running Gradient Descent optimization. We will initialize the coordinates Y of the low-dimensional embedding using Normal distribution and update Y via the Gradient Descent rule. KL-divergence will be monitored along the way of optimization to control for its decreasing.
N_LOW_DIMS = 2
LEARNING_RATE = 0.6
MAX_ITER = 200
np.random.seed(12345)
y = np.random.normal(loc = 0, scale = 1, size = (n, N_LOW_DIMS))
KL_array = []
print("Running Gradient Descent: \n")
for i in range(MAX_ITER):
y = y - LEARNING_RATE * KL_gradient(P, y)
plt.figure(figsize=(20,15))
plt.scatter(y[:,0], y[:,1], c=y_train.astype(int), cmap = 'tab10', s = 50)
plt.title("tSNE on Cancer Associated Fibroblasts (CAFs): Programmed from Scratch", fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.savefig('tSNE_Plots/tSNE_iter_' + str(i) + '.png')
plt.close()
KL_array.append(np.sum(KL(P, y)))
if i % 10 == 0:
print("KL divergence = " + str(np.sum(KL(P, y))))
plt.figure(figsize=(20,15))
plt.plot(KL_array)
plt.title("KL-divergence", fontsize = 20)
plt.xlabel("ITERATION", fontsize = 20); plt.ylabel("KL-DIVERGENCE", fontsize = 20)
plt.show()
Looks like the KL-divergence is decreasing, now it is time to display the final coordinates Y of the low-dimensional embeddings:
plt.figure(figsize=(20,15))
plt.scatter(y[:,0], y[:,1], c=y_train.astype(int), cmap = 'tab10', s = 50)
plt.title("tSNE on Cancer Associated Fibroblasts (CAFs): Programmed from Scratch", fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
The clusters are quite visible but somehow scattered. Let us visualize the animation of how the points group into the clusters:
#%%bash
#convert -delay 0 $(for i in $(seq 0 1 20; seq 21 10 199); do echo tSNE_iter_${i}.png; done) \
#-loop 0 tSNE_animated.gif
from IPython.display import Image
Image('/home/nikolay/Documents/Jupyter_Notebooks/tSNE_Plots/tSNE_animated.gif.png', width=2000)
For comparison, we will also run the scikit-learn version of tSNE in order to show how typically a tSNE for CAFs would look like. Please note that this is a default Burnes-Hut implementation meaning that the high-dimensional probabilities are computed for only nearest neighbours and the number of nearest neighbours is equal to 3*perplexity. In addition, early exaggeration parameter equal to 12 was applied. All these small details make actually a big difference in both speed and visualization quality of tSNE.
from sklearn.manifold import TSNE
model = TSNE(learning_rate = 100, n_components = 2, random_state = 123, perplexity = 30, verbose=2)
tsne = model.fit_transform(X_train)
plt.figure(figsize=(20, 15))
plt.scatter(tsne[:, 0], tsne[:, 1], c=y_train.astype(int), cmap = 'tab10', s = 50)
plt.title("tSNE on Cancer Associated Fibroblasts (CAFs): Burnes-Hut with Early Exaggeration", fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
Now, in order to reproduce the from scratch implementation, we will use the exact tSNE and swith off the early exaggeration.
from sklearn.manifold import TSNE
model = TSNE(learning_rate = 100, n_components = 2,
random_state = 123, perplexity = 30, early_exaggeration = 1, method = 'exact', verbose = 2)
tsne = model.fit_transform(X_train)
plt.figure(figsize=(20, 15))
plt.scatter(tsne[:, 0], tsne[:, 1], c=y_train.astype(int), cmap = 'tab10', s = 50)
plt.title("tSNE on Cancer Associated Fibroblasts (CAFs): Exact without Early Exaggeration", fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
Now the scattering of the clusters becomes more obvious, this resembles the picture from our from scratch implementation. Please note that the mean sigma was 7.54 while in our implemenattion it was 6.37. This is an interesting discrepancy which should be cleared in the future.
My first impression when I heard about UMAP was that it is a new interesting dimensionality reduction method which is based on solid mathematical principles and hence very different from tSNE which is a pure Machine Learning semi-empirical algorithm. My biological friends told me the original UMAP paper was "too mathematical", and looking at the Section 2 of the paper I was very happy to see strict and accurate mathematics coming to Life and Data Science. Reading the UMAP docs and watching Leland McInnes talk at SciPy 2018, I got puzzled and felt like UMAP is another neighbor graph technique which is so similar to tSNE that I was struggling to understand how exactly UMAP is different from tSNE.
From the UMAP paper, the differences between UMAP and tSNE are not very visible even though Lelan McInnes tries to summarize them in the Appendix C. I would rather say, I do see small differences but it is not clear why they bring such dammatic effects. Here I will first summarize what I noticed is different between UMAP and tSNE and then try to explain why thsese differences are important and figure out how large their effects are.
plt.figure(figsize=(20, 15))
y = np.linspace(0, 10, 1000)
my_prob = lambda y, a, b: np.power(1 + a*y**(2*b), -1)
plt.plot(y, my_prob(y, a = 1, b = 1))
plt.plot(y, my_prob(y, a = 1.93, b = 0.79))
plt.plot(y, my_prob(y, a = 1.93, b = 5))
plt.gca().legend(('a = 1, b = 1', 'a = 1.93, b = 0.79', 'a = 1.93, b = 5'), fontsize = 20)
plt.title("Low-Dimensional Probability of Pairwise Euclidean Distances", fontsize = 20)
plt.xlabel("Y", fontsize = 20); plt.ylabel("Q(Y)", fontsize = 20)
plt.show()
We can see that the family of curves is very sensitive to the parameter b, at large b it forms a sort of plateau at small Y. This implies that below UMAP hyperparameter min_dist all data points are equally tightly connected. Since the Q(Y) function behaves almost like a Heaviside step function it means that UMAP assigns almost the same low-dimensional coordiante for all points that are close to each other in the low-dimensional space. The min_dist is exactly what leads to the super-tightly packed clusters often observed in UMAP dimensionality reduction plots.
To demonstrate how exactly the fit from Eq. (5) looks like, let us display a simple piecewise function (where the plateau is defined by the min_dist parameter) and fit it using the family of functions $1/(1+ay^{2b})$ via optimize.curve_fit from scipy python library. As a result of the fit, we obtain the optial a and b parameters for the function $1/(1+ay^{2b})$.
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
MIN_DIST = 1
x = np.linspace(0, 10, 300)
def f(x, min_dist):
y = []
for i in range(len(x)):
if(x[i] <= min_dist):
y.append(1)
else:
y.append(np.exp(- x[i] + min_dist))
return y
dist_low_dim = lambda x, a, b: 1 / (1 + a*x**(2*b))
p , _ = optimize.curve_fit(dist_low_dim, x, f(x, MIN_DIST))
print(p)
plt.figure(figsize=(20,15))
plt.plot(x, f(x, MIN_DIST), "o")
plt.plot(x, dist_low_dim(x, p[0], p[1]), c = "red")
plt.title("Non-Linear Least Square Fit of Piecewise Function", fontsize = 20)
plt.gca().legend(('Original', 'Fit'), fontsize = 20)
plt.xlabel("X", fontsize = 20)
plt.ylabel("Y", fontsize = 20)
plt.show()
Since we need to know the gradient of the cross-entropy in oder to implement the Gradient Descent, let us quickly calculate it. Ignoring the constant terms containing only $p_{ij}(X)$ we can rewrite the cross-entropy and differentiate it as follows:
$$ CE(X, d_{ij}) = \sum_j \left[ -P(X)\log Q(d_{ij}) + \left(1-P(X)\right)\log\left(1 - Q(d_{ij})\right)\right], \hspace{2em}\text{where } d_{ij} = y_i - y_j $$$$ Q(d_{ij}) = \frac{1}{1 + a d_{ij}^{2b}}; \hspace{2em} 1 - Q(d_{ij}) = \frac{a d_{ij}^{2b}}{1 + a d_{ij}^{2b}}; \hspace{2em} \frac{\delta Q}{\delta d_{ij}} = -\frac{2ab d_{ij}^{2b-1}}{\left(1 + a d_{ij}^{2b}\right)^2} $$$$ \frac{\delta CE}{\delta y_i} = \sum_j \left[ -\frac{P(X)}{Q(d_{ij})}\frac{\delta Q}{\delta d_{ij}} + \frac{1 - P(X)}{1 - Q(d_{ij})}\frac{\delta Q}{\delta d_{ij}} \right] = \sum_j \left[ \left(-P(X)\left(1 + a d_{ij}^{2b}\right) + \frac{(1-P(X))\left(1 + a d_{ij}^{2b}\right)}{\left(a d_{ij}^{2b}\right)} \right)\frac{\delta Q}{\delta d_{ij}} \right] $$$$ \frac{\delta CE}{\delta y_i} = \sum_j \left[ \frac{2abd_{ij}^{2(b-1)}P(X)}{1 + a d_{ij}^{2b}} - \frac{2b(1-P(X))}{d_{ij}^2(1 + a d_{ij}^{2b})} \right]\left(y_i - y_j\right) \hspace{2em} (6) $$Graph Laplacian, Spectral Clustering, Laplacian Eignemaps, Diffusion Maps, Spectral Embedding etc. refer to practically the same interesting methodology that combines Matrix Factorization and Neighbor Graph approaches to the Dimensionality Reduction problem. In this methodology, we start with constructing a graph (or knn-graph) and formalize it with matrix algebra (adjucency and degree matrices) via constructing the Laplacian matrix, finally we factorize the Laplacian matrix, i.e. solving eigen-decomposition problem.
$$ L = D^{1/2}\left(D - A\right)D^{1/2} $$We can use scikit-learn Python library and easily display the initial low-dimensional coordinates using the SpectralEmbedding function:
from sklearn.manifold import SpectralEmbedding
model = SpectralEmbedding(n_components = 2, n_neighbors = 50)
se = model.fit_transform(np.log(X_train + 1))
plt.figure(figsize=(20,15))
plt.scatter(se[:, 0], se[:, 1], c = y_train.astype(int), cmap = 'tab10', s = 50)
plt.title('Laplacian Eigenmap', fontsize = 20)
plt.xlabel("LAP1", fontsize = 20)
plt.ylabel("LAP2", fontsize = 20)
plt.show()
Now let us briefly discuss why exactly they say that tSNE preserves only local structure of the data. Locality of tSNE can be understood from different points of view. First we have the $\sigma$ parameter in Eq. (1) which sets how locally our data points "feel" each other. Since the probability of pairwise Euclidean distances decays exponentially, at small values of $\sigma$, it is basically zero for distant points (large X) and grows dramatically fast only for nearest neighbors (small X). In contrast, at large $\sigma$, the probabilities for distant and close points become comparable and in the limit $\sigma\rightarrow \infty$, the probability becomes equal to 1 for all distances between any pair of points, i.e. points become equidistant.
plt.figure(figsize=(20, 15))
x = np.linspace(0, 10, 1000)
sigma_list = [0.1, 1, 5, 10]
for sigma in sigma_list:
my_prob = lambda x: np.exp(-x**2 / (2*sigma**2))
plt.plot(x, my_prob(x))
plt.gca().legend(('$\sigma$ = 0.1','$\sigma$ = 1', '$\sigma$ = 5', '$\sigma$ = 10'), fontsize = 20)
plt.title("High-Dimensional Probability of Pairwise Euclidean Distances", fontsize = 20)
plt.xlabel("X", fontsize = 20); plt.ylabel("P(X)", fontsize = 20)
plt.show()
Interestingly, if we expand the probability of pairwise Euclidean distances in high dimensions into Taylor series at $\sigma\rightarrow \infty$, we will get the power law in the second approximation:
$$ P(X) \approx \displaystyle e^{\displaystyle -\frac{X^2}{2\sigma^2}} $$$$ P(X) \underset{\displaystyle\sigma\rightarrow \infty}{\longrightarrow} 1 - \frac{X^2}{2\sigma^2} + \frac{X^4}{8\sigma^4} - \dots \hspace{2em} (7) $$The power law with respect to the pairwise Euclidean distances resembles the cost function for the Multi-Dimensional Scaling (MDS) which is known to preserve global distances by trying to preserve distances between each pair of points regardless whether they are far apart or close to each other. One can interpret this in a way that at large $\sigma$ tSNE does account for long-range interactions between the points, so it is not entirely correct to say that tSNE can handle only local distances. However, we usually restrict ourselves by finite values of perplexity, Laurens van der Maaten recommends perplexity values between 5 and 50, although perhaps a good compromise between local and global information would be to select perplexity approximately following the square root law $\approx N^{1/2}$, where N is the sample size.
On the other hand, $\sigma\rightarrow 0$ results in extreme "locality" in the behavior of the high dimensional probability which resembles the Dirac delta-function behavior.
$$ P(X) \underset{\displaystyle\sigma\rightarrow 0}{\longrightarrow} \delta_{\sigma}(X) \hspace{2em} (8) $$Another way to understand "locality" of tSNE is to think about the KL-divergance function. Let us try to plot it assuming X is a distance between points in high-dimensional space and Y is a low-dimensional distance:
$$ P(X) \approx e^{\displaystyle - X^2} \hspace{2em} Q(Y) \approx \frac{1}{1+Y^2} $$From the definition of the KL-divergence, Eq. (4):
$$ KL(X,Y) = P(X)\log\left(\frac{P(X)}{Q(Y)}\right) = P(X)\log P(X) - P(X)\log Q(Y) \hspace{2em} (9) $$The first term in Eq. (9) is close to zero for both large and small X. It goes to zero for small X since the exponent becomes close to 1 and $\log (1) = 0$. For large X this term still goes to zero because the exponential pre-factor goes faster to zero than the logarith goes to $- \infty$. Therefore, for intuitive understanding of the KL-divergence it is enough to to consider only the second term:
$$ KL(X,Y) \approx - P(X)\log Q(Y) = e^{\displaystyle - X^2} \log \left(1 + Y^2\right) $$This is a weird looking function, let us plot KL(X, Y):
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import LinearLocator, FormatStrFormatter
fig = plt.figure(figsize=(20, 15))
ax = fig.gca(projection = '3d')
# Set rotation angle to 30 degrees
ax.view_init(azim=30)
X = np.arange(0, 3, 0.1)
Y = np.arange(0, 3, 0.1)
X, Y = np.meshgrid(X, Y)
Z = np.exp(-X**2)*np.log(1 + Y**2)
# Plot the surface.
surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,linewidth=0, antialiased=False)
ax.set_xlabel('X', fontsize = 20)
ax.set_ylabel('Y', fontsize = 20)
ax.set_zlabel('KL (X, Y)', fontsize = 20)
ax.set_zlim(0, 2.2)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))
# Add a color bar which maps values to colors.
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.show()
The function has a very asymmetric shape. If the distance between the points in high dimensions X is small, the exponential prefactor is 1 and the logarithmic term behaves as $\log \left(1 + Y^2\right)$ meaning that if the distance in low dimensions Y is large as well, there will be a high penatly, therefore tSNE tries to reduce Y at small X in order to reduce the penalty. However, for large distances X in high dimensions, Y can be basically any value from 0 to $\infty$ since the exponential term goes to zero and always wins over the logarithmic term. Therefore it might happen that points far apart in high dimensions end up close to each other in low dimensions. In other words, tSNE does not guarantee that points far apart in high dimensions will be preserved to be far apart in low dimensions. However, it does guarantee that points close to each other in high dimensions will remain close to each other in low dimensions. So tSNE is not really good at projecting large distances into low dimensions, so it preserves only the local data structure provided that $\sigma$ does not go to $\infty$.
In contrast to tSNE, UMAP uses Cross-Entropy (CE) as a cost function instead of the KL-divergence:
$$ CE(X,Y) = P(X)\log\left(\frac{P(X)}{Q(Y)}\right) + \left(1-P(X)\right)\log\left(\frac{1-P(X)}{1-Q(Y)}\right) $$$$ CE(X, Y) = e^{\displaystyle - X^2} \log \left[e^{\displaystyle - X^2}\left(1 + Y^2\right)\right] + \left(1 - e^{\displaystyle - X^2}\right) \log \left[\frac{\left(1-e^{\displaystyle - X^2}\right)\left(1 + Y^2\right)}{Y^2}\right]\\ \approx e^{\displaystyle - X^2} \log \left(1 + Y^2\right) + \left(1-e^{\displaystyle - X^2}\right) \log \left(\frac{1 + Y^2}{Y^2}\right) $$This leads to dramatic changes in the local-global structure preservation balance. At small values of X we get the same limit as for tSNE since the second term disappears because of the prefactor and the fact that log-function is slower than polynomial function:
$$ X\rightarrow 0: CE(X, Y) \approx \log \left(1 + Y^2\right) $$therefore the Y coordinates are forced to be samll, i.e. $Y\rightarrow 0$, in order to minimize the penalty. This is exactly like tSNE behaves.
However, in the opposite limit of large X, i.e. $X\rightarrow\infty$, the first term disapperas, prefactor of the second term becomes 1 and we obtain:
$$ X\rightarrow \infty: CE(X, Y) \approx \log \left(\frac{1 + Y^2}{Y^2}\right) $$Here if Y is small, we get a high penalty because of the Y in the denominator of the logarithm, therefore Y is encouraged to be large so that the ratio under logarithm becomes 1 and we get zero penalty. Therefore we get $Y\rightarrow \infty$ at $X\rightarrow \infty$, so the global distances are preserved wehn moving from high- to low-dimensional space, exactly what we want.
To demonstrate this, let us plot the UMAP CE cost function:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import LinearLocator, FormatStrFormatter
fig = plt.figure(figsize=(20, 15))
ax = fig.gca(projection = '3d')
# Set rotation angle to 30 degrees
ax.view_init(azim=30)
X = np.arange(0, 3, 0.001)
Y = np.arange(0, 3, 0.001)
X, Y = np.meshgrid(X, Y)
Z = np.exp(-X**2)*np.log(1 + Y**2) + (1 - np.exp(-X**2))*np.log((1 + Y**2) / (Y**2+0.01))
# Plot the surface.
surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,linewidth=0, antialiased=False)
ax.set_xlabel('X', fontsize = 20)
ax.set_ylabel('Y', fontsize = 20)
ax.set_zlabel('CE (X, Y)', fontsize = 20)
ax.set_zlim(0, 4.3)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))
# Add a color bar which maps values to colors.
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.show()
Here we see that the "right" part of the plot looks fairly similar to the KL-divergence surface above. Meaning that at low X we still want to have low Y in order to reduce the penalty. However, at large X the Y distance really wants to be large as well because if it is small, the CE (X, Y) penalty will be enormous. Remember, previously, for KL (X, Y) surface, we did not have any difference in penalty between low and high Y at large X. That is why CE (X, Y) cost function is capable of preserving global distances as well as local distances.
Now when we have learnt how to program tSNE from scratch and understood the key differences between tSNE and UMAP, let us make the next step and try to implement the UMAP algorithm from scratch by modifying the tSNE code.
import numpy as np
import pandas as pd
from scipy import optimize
import matplotlib.pyplot as plt
from sklearn.manifold import SpectralEmbedding
from sklearn.metrics.pairwise import euclidean_distances
path = '/home/nikolay/WABI/K_Pietras/Manifold_Learning/'
expr = pd.read_csv(path + 'bartoschek_filtered_expr_rpkm.txt', sep='\t')
print(expr.iloc[0:4,0:4])
X_train = expr.values[:,0:(expr.shape[1]-1)]
X_train = np.log(X_train + 1)
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = expr.values[:,expr.shape[1]-1]
print("\nDimensions of the data set: ")
print(X_train.shape, y_train.shape)
dist = np.square(euclidean_distances(X_train, X_train))
rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]
print(dist[0:4, 0:4])
print('\n')
print(rho[0:4])
def prob_high_dim(sigma, dist_row):
"""
For each row of Euclidean distance matrix (dist_row) compute
probability in high dimensions (1D array)
"""
d = dist[dist_row] - rho[dist_row]
d[d < 0] = 0
return np.exp(- d / sigma)
def k(prob):
"""
Compute n_neighbor = k (scalar) for each 1D array of high-dimensional probability
"""
return np.power(2, np.sum(prob))
def sigma_binary_search(k_of_sigma, fixed_k):
"""
Solve equation k_of_sigma(sigma) = fixed_k
with respect to sigma by the binary search algorithm
"""
sigma_lower_limit = 0
sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if k_of_sigma(approx_sigma) < fixed_k:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_k - k_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
N_NEIGHBOR = 15
prob = np.zeros((n,n))
sigma_array = []
for dist_row in range(n):
func = lambda sigma: k(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, N_NEIGHBOR)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
sigma_array.append(binary_search_result)
if (dist_row + 1) % 100 == 0:
print("Sigma binary search finished {0} of {1} cells".format(dist_row + 1, n))
print("\nMean sigma = " + str(np.mean(sigma_array)))
#P = prob + np.transpose(prob) - np.multiply(prob, np.transpose(prob))
P = (prob + np.transpose(prob)) / 2
MIN_DIST = 0.25
x = np.linspace(0, 3, 300)
def f(x, min_dist):
y = []
for i in range(len(x)):
if(x[i] <= min_dist):
y.append(1)
else:
y.append(np.exp(- x[i] + min_dist))
return y
dist_low_dim = lambda x, a, b: 1 / (1 + a*x**(2*b))
p , _ = optimize.curve_fit(dist_low_dim, x, f(x, MIN_DIST))
a = p[0]
b = p[1]
print("Hyperparameters a = " + str(a) + " and b = " + str(b))
a = 1
b = 1
def prob_low_dim(Y):
"""
Compute matrix of probabilities q_ij in low-dimensional space
"""
inv_distances = np.power(1 + a * np.square(euclidean_distances(Y, Y))**b, -1)
return inv_distances
def CE(P, Y):
"""
Compute Cross-Entropy (CE) from matrix of high-dimensional probabilities
and coordinates of low-dimensional embeddings
"""
Q = prob_low_dim(Y)
return - P * np.log(Q + 0.01) - (1 - P) * np.log(1 - Q + 0.01)
def CE_gradient(P, Y):
"""
Compute the gradient of Cross-Entropy (CE)
"""
y_diff = np.expand_dims(Y, 1) - np.expand_dims(Y, 0)
inv_dist = np.power(1 + a * np.square(euclidean_distances(Y, Y))**b, -1)
Q = np.dot(1 - P, np.power(0.001 + np.square(euclidean_distances(Y, Y)), -1))
np.fill_diagonal(Q, 0)
Q = Q / np.sum(Q, axis = 1, keepdims = True)
fact = np.expand_dims(a * P * (1e-8 + np.square(euclidean_distances(Y, Y)))**(b-1) - Q, 2)
return 2 * b * np.sum(fact * y_diff * np.expand_dims(inv_dist, 2), axis = 1)
N_LOW_DIMS = 2
LEARNING_RATE = 1
MAX_ITER = 200
np.random.seed(12345)
model = SpectralEmbedding(n_components = N_LOW_DIMS, n_neighbors = 50)
y = model.fit_transform(np.log(X_train + 1))
#y = np.random.normal(loc = 0, scale = 1, size = (n, N_LOW_DIMS))
CE_array = []
print("Running Gradient Descent: \n")
for i in range(MAX_ITER):
y = y - LEARNING_RATE * CE_gradient(P, y)
plt.figure(figsize=(20,15))
plt.scatter(y[:,0], y[:,1], c = y_train.astype(int), cmap = 'tab10', s = 50)
plt.title("UMAP on Cancer Associated Fibroblasts (CAFs): Programmed from Scratch", fontsize = 20)
plt.xlabel("UMAP1", fontsize = 20); plt.ylabel("UMAP2", fontsize = 20)
plt.savefig('UMAP_Plots/UMAP_iter_' + str(i) + '.png')
plt.close()
CE_current = np.sum(CE(P, y)) / 1e+5
CE_array.append(CE_current)
if i % 10 == 0:
print("Cross-Entropy = " + str(CE_current) + " after " + str(i) + " iterations")
plt.figure(figsize=(20,15))
plt.plot(CE_array)
plt.title("Cross-Entropy", fontsize = 20)
plt.xlabel("ITERATION", fontsize = 20); plt.ylabel("CROSS-ENTROPY", fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
plt.scatter(y[:,0], y[:,1], c = y_train.astype(int), cmap = 'tab10', s = 50)
plt.title("UMAP on Cancer Associated Fibroblasts (CAFs): Programmed from Scratch", fontsize = 20)
plt.xlabel("UMAP1", fontsize = 20); plt.ylabel("UMAP2", fontsize = 20)
plt.show()
We can see distinct and tight clusters similar to what it looks like at a UMAP plot. Moreover, these cell populations are super-easy to cluster now with any cluster algorithm, perferably graph-based or density-based, however even K-means or Gaussian Mixture Model (GMM) would do a good job on such a clear dimension reduction plot. Let us animate the clustering process in order to see how the points groups into the clusters:
#%%bash
#convert -delay 0 $(for i in $(seq 0 1 20; seq 21 10 199); do echo UMAP_iter_${i}.png; done) \
#-loop 0 UMAP_animated.gif
from IPython.display import Image
Image('/home/nikolay/Documents/Jupyter_Notebooks/UMAP_Plots/UMAP_animated.gif.png', width=2000)
For comparison let us use the original Leland McInnes Python implementation of UMAP with identical hyperparameters in oder to compare it with our "from scratch" UMAP implementation:
from umap import UMAP
plt.figure(figsize=(20,15))
model = UMAP(n_neighbors = 15, min_dist = 0.25, n_components = 2, verbose = True)
umap = model.fit_transform(X_train)
plt.scatter(umap[:, 0], umap[:, 1], c = y_train.astype(int), cmap = 'tab10', s = 50)
plt.title('UMAP', fontsize = 20)
plt.xlabel("UMAP1", fontsize = 20)
plt.ylabel("UMAP2", fontsize = 20)
plt.show()
We conclude that the original UMAP implementation from Leland McInnes looks quite similar to our "from scratch" implementation. In a sense, the clusters in our dimension reduction look even more distinct. This slight "otperformance" of original UMAP was achieved by setting a = 1 and b = 1 like for Student t-distribution and using the tSNE symmetrization, Eq. (1), and not the symmetrization with the Hadamard product from the UMAP algorithm.
My prediction is that UMAP is just the beginning, in the future there will be more and better dimension reduction techniques because it is quite straightforward to tune low/high dimensional distributions, make better normalizations, better cost functions and play with attractive/repulsive forces for the N-body problem in order to get even better low-dimensional representations. I expect to have an avelanche of such techniques for the Single Cell research area because even I can improve UMAP low-dimensional representation making small modifications. If I can do it, many people should be able to doit, so there are more dimensions reduction techniques to come.
We know that UMAP is faster than tSNE when it concerns a) large number of samples, b) number of embedding dimensions greater than 2-3, c) large number of ambient dimensions in the original data set. Let us try to understand how superiority of UMAP over tSNE comes from the math and algoruthmic implementation.
Both tSNE and UMAP essentially consist of two steps.
Building a graph in high dimensions and determining the bandwidth of the exponential probability, $\sigma$, using binary search and the fixed number of nearest neighbors to consider.
Optimization of low-dimensional embeddings via Gradient Descent. The second step is the bottleneck of the algorithm, it is consequtive and can not be multi-threaded. Since both tSNE and UMAP do the second step, it is not obvious why UMAP can do it more efficiently than tSNE.
However I noticed that the fist step became much faster for UMAP than it was for tSNE. This is because of two reasons.
Next, UMAP actually becomes faster on the second step as well. This improvement has also a few reasons:
Here I will briefly comment on a few things about UMAP that are still unclear to me despite a few weeks of thinking. Would really appreciate if at some point Leland McInnes or somebody else clarifies them.